143 research outputs found
Hybrid LSH: Faster Near Neighbors Reporting in High-dimensional Space
We study the -near neighbors reporting problem (-NN), i.e., reporting
\emph{all} points in a high-dimensional point set that lie within a radius
of a given query point . Our approach builds upon on the
locality-sensitive hashing (LSH) framework due to its appealing asymptotic
sublinear query time for near neighbor search problems in high-dimensional
space. A bottleneck of the traditional LSH scheme for solving -NN is that
its performance is sensitive to data and query-dependent parameters. On
datasets whose data distributions have diverse local density patterns, LSH with
inappropriate tuning parameters can sometimes be outperformed by a simple
linear search.
In this paper, we introduce a hybrid search strategy between LSH-based search
and linear search for -NN in high-dimensional space. By integrating an
auxiliary data structure into LSH hash tables, we can efficiently estimate the
computational cost of LSH-based search for a given query regardless of the data
distribution. This means that we are able to choose the appropriate search
strategy between LSH-based search and linear search to achieve better
performance. Moreover, the integrated data structure is time efficient and fits
well with many recent state-of-the-art LSH-based approaches. Our experiments on
real-world datasets show that the hybrid search approach outperforms (or is
comparable to) both LSH-based search and linear search for a wide range of
search radii and data distributions in high-dimensional space.Comment: Accepted as a short paper in EDBT 201
Press freedom and reporting on the government in Myanmar
Professional project report submitted in partial fulfillment of the requirements for the degree of Masters of Arts in Journalism from the School of Journalism, University of Missouri--Columbia.This project investigates the state of press freedom in Myanmar by comparing reporting on the government in Myanmar before and after the government lifted central censorship on print media in August 2012. Eleven semi-structured interviews were conducts. Subjects are twelve experienced journalists, who have been covering Myanmar for years. The results show a significant change in the condition of working as well as the attitude towards reporting in Myanmar. While skepticism on a total free-press is still there, many journalists are optimistic about the future.Includes bibliographic references
A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data
Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of ap-plications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Follow-ing up on the work of Kriegel et al. (KDD ’08), we inves-tigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estima-tion algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is effi-cient and scalable to very large high-dimensional data sets
Revisiting Wedge Sampling for Budgeted Maximum Inner Product Search
Top-k maximum inner product search (MIPS) is a central task in many machine
learning applications. This paper extends top-k MIPS with a budgeted setting,
that asks for the best approximate top-k MIPS given a limit of B computational
operations. We investigate recent advanced sampling algorithms, including wedge
and diamond sampling to solve it. Though the design of these sampling schemes
naturally supports budgeted top-k MIPS, they suffer from the linear cost from
scanning all data points to retrieve top-k results and the performance
degradation for handling negative inputs.
This paper makes two main contributions. First, we show that diamond sampling
is essentially a combination between wedge sampling and basic sampling for
top-k MIPS. Our theoretical analysis and empirical evaluation show that wedge
is competitive (often superior) to diamond on approximating top-k MIPS
regarding both efficiency and accuracy. Second, we propose a series of
algorithmic engineering techniques to deploy wedge sampling on budgeted top-k
MIPS. Our novel deterministic wedge-based algorithm runs significantly faster
than the state-of-the-art methods for budgeted and exact top-k MIPS while
maintaining the top-5 precision at least 80% on standard recommender system
data sets.Comment: ECML-PKDD 202
Kinetics and mechanism of various iron transformations in natural waters at circumneutral pH.
In this thesis, the implementation and results of studies into the effect of pH on the kinetics of various iron transformations in natural waters are described. Specific studies include i) the oxidation of Fe(II) in the absence and presence of both model and natural organic ligands, ii) the complexation of Fe(III) by model organic compounds, and iii) the precipitation of Fe(III) through the use of both laboratory investigations of iron species and kinetic modeling.
In the absence of organic ligands, oxidation of nanomolar concentrations of Fe(II) over the pH range 6.0 -- 8.0 is predominantly controlled by the reaction of Fe(II) with oxygen and with superoxide while the disproportionation of superoxide appears to be negligible. Oxidation of Fe(II) by hydrogen peroxide, back reduction of Fe(III) by superoxide and precipitation of Fe(III) have been shown to exert some influences at various stages of the oxidation at different pH and initial Fe(II) concentrations. In the presence of organic ligands, different effects on the Fe(II) oxidation kinetics is shown with different organic ligands, their initial concentrations and with varying pH. A detailed kinetic model is developed and shown to adequately describe the kinetics of Fe(II) oxidation in the absence and presence of various ligands over a range of concentrations and pH. The applicability of the previous oxidation models to describe the experimental data is assessed.
Rate constants for formation of Fe(III) by a range of model organic compounds over the pH range 6.0 -- 9.5 are determined. Variation of rate constants for Fe(III) complexation by desferrioxamine B and ethylenediaminetetraacetate with varying pH is explained by an outer-sphere complexation model. The significant variation in rate constants of Fe(III) complexation by salicylate, 5-sulfosalicylate, citrate and 3,4-dihydroxylbenzoate with varying pH is possibly due to the presence of different complexes at different pH. The results of this study demonstrate that organic ligands from different sources may influence the speciation of iron in vastly different ways.
The kinetics of Fe(III) precipitation are investigated in bicarbonate solutions over the pH range 6.0 -- 9.5. The rate of precipitation varies by nearly two orders of magnitude with a maximum rate constant at a pH of around 8.0. The results of the study support the existence of the dissolved neutral species Fe(OH)30 and suggests that it is the dominant precursor in Fe(III) polymerization and subsequent precipitation at circumneutral pH. Variation in the precipitation rate constant over the pH range considered is consistent with a mechanism in which the kinetics of iron precipitation are
controlled by rates of water exchange in dissolved iron hydrolysis species
I/O-Efficient Similarity Join
We present an I/O-efficient algorithm for computing similarity joins based on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly suggested our method has provable sub-quadratic dependency on the data size. Further, in contrast to straightforward implementations of known LSH-based algorithms on external memory, our approach is able to take significant advantage of the available internal memory: Whereas the time complexity of classical algorithms includes a factor of N-rho, where rho is a parameter of the LSH used, the I/O complexity of our algorithm merely includes a factor (N/M)(rho), where N is the data size and M is the size of internal memory. Our algorithm is randomized and outputs the correct result with high probability. It is a simple, recursive, cache-oblivious procedure, and we believe that it will be useful also in other computational settings such as parallel computation
Effects of ASE noise and dispersion chromatic on performance of DWDM networks using distributed Raman amplifiers
We investigate effects of amplified spontaneous emission noise (ASE), noise figure (NF) and dispersion chromatic on the performance of DWDM networks using distributed optical fiber Raman amplifiers (DRAs) in two different pump configurations, i.e., forward and backward pumping. We found that the pumping configurations, ASE noise, and dispersion play an important role in network performance improving since it reduces noise figure and bit error rate (BER) of the system. Simulation results show that the lowest bit error rate and noise figure when using forward pumping configuration. Moreover, we have also compared ASE noise powers of the simulation with these of the experiment, they are match
- …